Prediction study of electric energy production in important power production base, China

Xinjiang is an important power production base in China, and its electric energy production needs not only meet the demand of Xinjiang's electricity consumption, but also make up for the shortage of electricity in at least 19 provinces or cities in China. Therefore, it is of great significance to know ahead of time the electric energy production of Xinjiang in the future. In such terms, accurate electric energy production forecasts are imperative for decision makers to develop an optimal strategy that includes not only risk reduction, but also the betterment of the economy and society as a whole. According to the characteristics of the historical data of monthly electricity generation in Xinjiang from January 2001 to August 2020 , the suitable and widely used SARIMA (Seasonal autoregressive integrated moving mean model) method and Holt-winter method were used to construct the monthly electric energy production in Xinjiang for the first time. The results of our analysis showed that the established SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1)12 model had higher prediction accuracy than that of the established Holt-Winters' multiplicative model. We predicted the monthly electric energy production from August 2021 to August 2022 by the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1)12 model, and errors are very small compared to the actual values, indicating that our model has a very good prediction performance. Therefore, based on our study, we provided a simple and easy scientific tool for the future power output prediction in Xinjiang. Our research methods and research ideas can also provide scientific reference for the prediction of electric energy production elsewhere.

From the perspective of changes in power consumption demand, with the continuous improvement of the level of social informatization, power will become the most important terminal consumption energy, and its status will continue to rise, and power consumption will continue to grow, especially with the coming of the information and Internet Age. The degree of electrification of the whole society is increasing, and the demand of electric power consumption is increasing obviously [1][2][3] . This also puts forward higher requirements for the power generation capacity of the power industry. Therefore, scientific forecasting of the electric energy production of Xinjiang is of great significance to the development planning of the power industry of Xinjiang. It can help Xinjiang and the provinces and cities supplied electricity by Xinjiang to accurately grasp the situation of power supply, make accurate predictions, and make good electricity demand arrangements in advance.
A common method of prediction is to establish an appropriate prediction model and make prediction analysis according to the characteristics of time series data. An important way to analyze time series is to study the statistical laws of the data generation patterns, and to assume that these laws will still play an important role in the future. Many mathematical models can be established to approximate this law and to make reasonable predictions for variables [4][5][6] . In the 1970s, the American scholar Box and the British statistician Jenkins cooperated with each other to develop a perfect statistical prediction method named Box-Jenkins method 5,6 . There are many models in this method: autoregressive model AR (p), moving average model MA (q), autoregressive moving average model ARMA (p, q), autoregressive integrated moving average model ARIMA (p,d,q), seasonal autoregressive integrated moving average model SARIMA (p,d,q)(P,D,Q) s, etc. All the first four models are special forms of SARIMA (p,d,q)(P,D,Q)s models. In above models, the p is the order of autoregression, the q is the order of moving average, the d is the times of ordinary difference when the time series becomes stationary, P is the order of seasonal autoregression, Q is the order of seasonal moving average, and D is seasonal difference times, and s is the seasonal cycle. Generally speaking, for the monthly time series, s is 12. In the analysis of time series prediction, we often need to use different models according to the characteristics of data changes. Because Box-Jenkins method can often obtain high prediction accuracy, they are widely used in time series prediction analysis in various fields 7,8 . Application of Box-Jenkins methods in non-energy forecasting: Ilie et al. 9 pointed out that ARIMA models were suitable for making predictions during COVID-19 crisis and offered an idea of the COVID-19 epidemiological stage of Ukraine, Romania, the Republic of Moldova, Serbia, Bulgaria, Hungary, USA, Brazil, and India. Hernandez-Matamoros et al. 10 applied ARIMA models to forecast the COVID19 of many regions successfully. He et al. 11 found that the ARIMA model could effectively predict the positive rate of influenza virus in a short time in Wuhan, China. Fanoodi et al. 12 pointed out the ARIMA models was more accurate in predicting the uncertainties in demand than the baseline model used in Zahedan Blood Transfusion Center. Zheng et al. 13 used the ARIMA model to predict the total health expenditure in China from 1978 to 2022. Liu et al. 14 found that the ARIMA model could be used to predict the seasonality and trend of pulmonary tuberculosis in the Chinese population. Keskin et al. 15 applied ARIMA model to simulate total electron content, earthquake and radon relationship identification. Yingzi et al. 16 applied ARIMA model to predict vehicle speed. Application of Box-Jenkins methods in energy forecasting: González-Romera et al. 17 found that the ARIMA model could be used to predict the medium-term electric energy demand based on the Spanish monthly electric demand series. Parag et al. 18 revealed that ARIMA (1,0,0)(0,1,1) model was the best fitted model for energy consumption and ARIMA (0,1,4)(0,1,1) was the best fitted model for greenhouse emission of a pig iron manufacturing organization of India. Aasim et al. 19 put forward the ARIMA model for very short-term wind speed forecasting. Contreras et al. 20 pointed out that ARIMA model was good to predict next-day electricity prices. Kavasseri et al. 21 found that ARIMA models could forecast day-ahead wind speed well. Wang et al. 22 did a good prediction for U.S. shale gas monthly production using a hybrid ARIMA and metabolic nonlinear grey model.
The exponential smoothing method is also a perfect statistical prediction method, which is widely used in forecasting research. According to the different times of smoothing, the exponential smoothing method is divided into: the single exponential smoothing method, the double exponential smoothing method and the triple exponential smoothing method 23 . The triple exponential smoothing model was developed by Holt and Winters, which is also called Holt-Winters method, it includes Holt-Winters' additive method and Holt-Winters' multiplicative methods. Liljana et al. 24 found that Holt-Winters methods ensured the best forecasting values in purpose of long-term heat load forecasting and monthly short-term heat load forecasting of the Company Energetika Ljubljana in the Republic of Slovenia. Vincenzo et al. 25 employed Holt-Winters exponential smoothing method for the nonresidential electricity consumption prediction in Romania, they found Holt-Winter' prediction accuracy was good in relation to the time horizon considered in their study. Guan et al. 26 developed Holt-Winters additive model and Holt-Winters multiplicative model for short-term extrapolation forecast based on monthly reported human brucellosis cases in mainland China. Zhang et al. 27 found that Holt winter method could predict tuberculosis registration rates in Henan Province, China successfully [28][29][30][31] .
In this study, we carefully analyzed the trend of historical monthly electric energy production in Xinjiang. According to the characteristics of the data changes, we tried to build SARIMA model 4 , Holt-Winters' additive model and Holt-Winters' multiplicative model 5 to do fitting analysis of Xinjiang monthly electricity generation. And then, we compared and analyzed the fitting and prediction precision of these established models. Finally, we applied the established model to do the prediction analysis of Xinjiang monthly power generation from August 2021 to December 2022. Our prediction results could provide a scientific reference for Xinjiang and some provinces and cities of needing Xinjiang electric power to do a good job in the allocation of power resources in advance. Our research methods can also provide research ideas for researchers to predict power production in other place.

Data and methodology
Data. In this study, we focus on the prediction and analysis of Xinjiang's monthly electric energy production.
We collected the data of Xinjiang's monthly electric energy production from January 2001 to August 2022, including 260 months' data, which are derived from the National Bureau of Statistics of China. Our research area and Xinjiang annual electric energy production data are shown in Fig where, and s denote non-seasonal and seasonal differences, respectively.ϕ, �, θ and � are the parameters of the model, ε t is white noise with independent and identical distribution. A sparse coefficient model is a special case of SARIMA model. If some of the coefficients in the SARIMA model are 0, then, the model becomes a sparse coefficient model. If only the autoregressive part has some missing terms, the sparse coefficient model can be recorded as:SARIMA((p1,…,pm),d,q)(P,D,Q)s. The construction of SARIMA model has main four steps: Step 1. SARIMA model is built on the basis of stationary time series, so the stationarity of time series is an important prerequisite for modeling. The Augmented Dickey-Fuller (ADF) unit root test model can be used to test the stationary of time series (if p-value is less than 0.05, the data is stationary). If the time series is un-stationary, it can be stabilized by some operations, such as ordinary difference or seasonal difference.
Step 2. To draw the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the smooth data, which can help to determine the possible values of P, Q, p, and q in the model.
Step 3. After determining p, q, P and Q values, it is necessary to check the parameters of the model for determining the values of p, q, P and Q, and calculate the R 2 , Akaike information criterion (AIC) and Schwarz (1) where, L is the maximum likelihood of the model, n is the number of observations, and k is the number of variables in the model.
Step 4. To plot ACF and PACF and do Box-Jenkins Q test of residuals to help judging whether or not model residuals are white noise. If the residuals are white noise, the autocorrelation coefficients and partial correlation coefficients of the residuals are basically within twice the standard deviation, and the p-value of Box-Jenkins Q test is greater than 0.05, which indicates that the model has good fitting performance and can be used for prediction analysis.
To understand more intuitively the steps of SARIMA model building, we draw SARIMA flow chart Fig. 2.
Holt-Winters' method. Holt-Winters' method is generally more suitable for forecasting and analyzing time series with trend, seasonality and randomness.
Holt-Winters' additive model has the following expression 31-34 : Holt-Winters' multiplicative model has the following expression 31-34 : s t−m+h is the seasonal term. α, β, and γ are the smoothing parameters. m is seasonal periods, and h is the predicted step size. There are three main steps for Holt-Winters modeling process: first, to do model parameter estimation; Second, to do model fitting accuracy analysis, third, using the Box-Jenkins Q method and the normal distribution map of the residuals to test whether or not the residual data is white noise. If the test can pass, it shows that the model has good fitting performance, then, model can be used for prediction analysis.
The indexes for model comparison. Root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) are the measure indexes of the accuracy of model fitting, and they are widely used to compare the accuracy of model prediction. The smaller the three values, the higher the fitting accuracy, the better the model performance. In this study, these three indexes are used to compare the performance of SARIMA model and Holt-Winters model. where, Data analysis software. In the study, data were analyzed using ArcMap10.4, R3.6.2, and Eviews7.0.

Results
We divided the data into three parts; the data that was used for the modeling in this study are monthly electric energy production in Xinjiang from January 2001 to July 2020. Data from August 2020 to July 2021 were used to test the model prediction effect, and data from August 2021 to August 2022 were used to view the model prediction performance. The change diagram of the time series for modeling is shown in Fig. 3. It can be seen from the diagram that the time series has obvious trend and randomness. From 2001 to 2010, Xinjiang's electric energy production showed a slow growth trend. And from 2011 to 2020, it showed a rapid growth, and the fluctuation of monthly electric energy production increased.
Modeling analysis of SARIMA model. The SARIMA model takes into account not only the dependence of economic phenomena on time series, but also the disturbance of stochastic fluctuation in the process of economic forecasting; it is one of the widely used methods in recent years. During the construction of the SARIMA model, the data must be stationary, therefore, we first used ADF to test whether or not the time series from January 2001 to July 2020 was stationary. The test results showed that the p-value was greater than 0.05, which indicated that the original time series was not stationary, so, we did a common difference of data. The ADF test of the data after difference showed that the data was still not stationary. Then, we did the secondary ordinary difference of the data, the p-value of the ADF test of the data after the secondary ordinary difference was less than 0.05, this indicated that the data after the secondary difference was stationary (d = 2, D = 0). And the test results were shown in Table 1. To draw the ACF and PACF of stationary data (see Fig. 4), we could see these correlation coefficients of the data at lag 1, 5, 6, 12 and 24 were relatively large, so we let q take 1, 5 or 6,and Q take1. Because these partial correlation coefficients of the data at lag 1, 2, 3, 4, 6, 7, 11 and 12 were relatively large, so we let p take 1, 2, 3, 4, 6, 7 or 11, and P take1, s take 12. According to the combination of the values of p, q, P, Q, several SARIMA models were established and the parameters of the models were tested, and the R 2 , AIC and SC values of the model were calculated simultaneously. In the end, only six models passed the parameters test, and the results were shown in Table 2. The AIC and SC of the Model 1 were the smallest. We used the Box-Jenkins Q method to test whether or not the residual was white noise, and the p-value of the test was less than 0.05, which indicated that the correlation between the residuals was significant. Therefore, the residuals were not white noise, which showed that the model was not good enough to be used for prediction analysis. When comparing the R 2 , AIC and SC values of the remaining five models, it was found that the Model 6 had the largest R 2 and the smallest AIC. The p-value of Box-Jenkins Q test of Model 6 was more than 0.05, which indicated that there was no correlation between model residues. Furthermore, the ACF and PACF of the residuals of Model 6 were plotted (see Fig. 5). The autocorrelation and partial correlation coefficients of the residuals were almost within twice the standard deviation, this further indicated that the residuals at each lag were not correlated and they were white noise, which indicated that Model 6 has a good fitting performance, and could capture original data information well. Therefore, Model 6 could be used for prediction analysis. The specific expression of Model 6 was SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 . Fig. 3, we can see that the time series for the modeling has obvious trend, and the fluctuation of data is increased with the passage of time. We decomposed the time series of Xinjiang electric energy production data from January 2001 to July 2020 using the R software decompose() function. As shown in Fig. 6, we could see that the time series was trend, seasonal and random. According to all the above data characteristics, we wanted to build the best Holt-Winters model to forecast and analyze the electric energy production data in Xinjiang. We used the ets() function package of R software to find the best smoothing parameters of model. First, we constructed Holt-Winters' additive model, we obtained α = 0.2418, β = 0.0191, and γ = 0.4914. Using the Box-Jenkins Q method to test whether the model residuals www.nature.com/scientificreports/ were white noise, the results showed that the p-value was less than 0.05 (p-value = 0.02). Furthermore, from the residual normal distribution Q-Q chart and histogram (see Fig. 7), we could see that the residual error did not obey the normal distribution, which indicated that the model residual was not white noise, indicating that the model fitting accuracy was not high, and the model couldn't be used to predict Xinjiang monthly electric energy production. Second, we constructed Holt-Winters' multiplicative model, we obtained α = 0.6204, β = 0.0223, and γ = 0.0001. The p-value of Box-Jenkins Q test of model residual was more than 0.05 (p-value = 0.66) for the    www.nature.com/scientificreports/ established multiplicative model, and the residual normal distribution Q-Q chart and histogram (see Fig. 8) showed that the residual error obeyed the normal distribution. These indicated that the residuals of Holt-Winters' multiplicative model was white noise, and fitting accuracy of this model was high. Therefore, Holt-Winters' multiplicative model could be used to predict Xinjiang monthly electric energy production.

Discussion
In this study, firstly, according to the characteristics of Xinjiang monthly electric energy production time series data, we established the best SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model. We could see from Table 2 (model 6) that all the parameters of the model passed the test (p-value were less than 0.05). From the autocorrelation and partial correlation Fig. 5 of the model residuals, it can be seen that the autocorrelation and partial correlation coefficients of the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model residuals were basically in the double standard deviation, indicating that the residuals of the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model were white noise, and the model had good performance. We could see from the fitting curve of the historical data of the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model (in Fig. 9) that the fitting curve of the model basically coincided with the original Xinjiang monthly electric energy production time series, which indicated that the fitting accuracy of the model was very high. Secondly, we used the ets() function package of R software to construct Holt-Winters' additive model, but when we did the residual test of the model, the result showed that the model residuals were not white noise, therefore, the model fitting accuracy was not high, and Holt-Winters' additive model was not suitable for predicting the future monthly electric energy production of Xinjiang. Finally, we constructed the Holt-Winters' multiplicative model, the p-value of the residual test of the model was greater than 0.05 and the Q-Q chart of model residuals and Histogram (see Fig. 8) showed that the residuals basically obeyed normal distribution, which indicated that the model residuals were white noise and the model had good performance. Using Holt-Winters' multiplicative model to fit the historical data of Xinjiang monthly electric energy production (see Fig. 9), the model fitting curve basically coincided with the original Xinjiang monthly electric power output time series, which indicated that the model fitting accuracy was very high. To establish the best forecast model of Xinjiang monthly electric energy production, we compared the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model and Holt-Winters' multiplicative model fitting accuracy and prediction accuracy (see Table 3). We found that the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model has a better predictive performance than that of the Holt-Winters'   www.nature.com/scientificreports/ multiplicative model. Therefore, we applied the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model to predict Xinjiang's monthly electric energy production from August 2021 to December 2022. From Table 4, we can see the errors are relatively small, which indicates SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model can well predict the electric energy production in Xinjiang. From the Fig. 10 we can see that the monthly electric energy production of Xinjiang from August 2020 to December 2022 shows a fluctuating upward trend, which is consistent with the actual situation. Some studies often found that the prediction effect of a single model was not good, so the combination prediction was used, and their research results showed that the combination prediction could achieve more accurate results 35 . However, in this case, the prediction model is often more complex and not easy to operate in the actual prediction analysis. In our study, three models were used, and two models were compared for the prediction performance. A series of analysis results showed that the single SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model has high prediction accuracy when predicting the output of Xinjiang electric power (see Fig. 9). A single model is relatively simple and is easier to use when doing the actual predictive analysis.
In this study, although the fit and prediction accuracy of the SARIMA((1,2,3,4,6,7,11),2,1)(1,0,1) 12 model are relatively high, there are also some errors. The reason of errors is that there are many factors affecting electricity production, such as population size, industrial development scale, people's living standards, the speed of economic development, and public health emergencies such as COVID-19. In our study, we only used historical power production data for predictive analysis, not considering other factors, because we thought that adding these factors will increase the model complexity, and these factors will also have many uncertainties, which may not necessarily improve the prediction accuracy of the model. Interested readers can do further research.
Considering that the forecasting uncertainty may affect the decision making process and increase the risk of scheduling, some studies have used interval prediction for their predictive analysis and got better prediction effect 36,37 . In our next-step study, we will consider doing interval prediction analysis in an attempt to find models with higher optimal prediction accuracy.

Conclusions
Electric Power plays a vital role in the national economy and people's livelihood, especially in the peak period of electricity consumption. Early prediction of electric energy production can provide scientific reference for the rational planning and distribution of power demand. Based on the monthly power output data of Xinjiang from January 2001 to August 2022, this study is the first time to construct a prediction model that can relatively accurately predict the electric energy production in Xinjiang. Although the methods we used were not complex, our prediction accuracy was very high, which provided a kinds of simple and easy-to-use scientific tools for the future energy production prediction in Xinjiang. Our research methods and research ideas can also provide a reference for other researchers to make power prediction in some place.

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.